There are three datasets for this project:
burglaries_2023.csv: Contains data on the aggravated burglary incidents in Davidson County. This was obtained from https://data.nashville.gov/Police/Metro-Nashville-Police-Department-Incidents/2u6v-ujjs. census.csv: Census tract level data on population and median income. This was obtained from the US Census American Community Survey. DC: A shapefile containing Davidson County census tracts Perform a spatial join to determine the census tract in which each burglary occurred. Hint: You may want to make use of the st_as_sf function in order to convert the burglaries data into an sf object.
After performing the spatial join, merge in the census data. Note: Make sure that the final dataset contains all census tracts.
library(sf)
library(tidyverse)
library(lubridate)
library(leaflet)
library(MASS)
burglaries <- read_csv("../data/burglaries_2023.csv")
Rows: 1204 Columns: 29── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (17): primary_key, report_type, report_type_description, incident_status_code, incident_status_description, investigation_status, incident_locat...
dbl (8): incident_number, latitude, longitude, zip_code, location_code, offense_number, offense_nibrs, victim_number
lgl (2): domestic_related, mapped_location
dttm (2): incident_occurred, incident_reported
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
census <- read_csv("../data/census.csv")
Rows: 174 Columns: 6── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): NAME, county, tract
dbl (3): state, population, median_income
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Check to see how many rows there are in census?
census %>%
nrow()
[1] 174
dc <- st_read("../data/DC/DC.shp")
Reading layer `DC' from data source `C:\Users\micha\Documents\NSS_Projects\geospatial-poisson-michaeltatarjr\data\DC\DC.shp' using driver `ESRI Shapefile'
Simple feature collection with 174 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.0547 ymin: 35.96778 xmax: -86.51559 ymax: 36.4055
Geodetic CRS: NAD83
Checking to see how many rows there are in dc census tract?
dc %>%
distinct(TRACTCE) %>%
nrow()
[1] 174
#convert burglaries into a geospatial object. Do this using the st_as_sf function.
burglaries_converted <- st_as_sf(burglaries |> drop_na(latitude),
coords = c('longitude', 'latitude'),
crs = st_crs(dc)
)
#Now, Perform a spatial join to determine the census tract in which each burglary occurred.
burglaries_combined <- st_join(dc, burglaries_converted, join = st_contains)
#After spatial join, checking to see how many rows there are…
burglaries_combined %>%
distinct(TRACTCE) %>%
nrow()
[1] 174
census %>%
head()
burglaries_combined<- merge(burglaries_combined, census, by.x = 'TRACTCE', by.y ='tract', all = TRUE)
burglaries_combined %>%
head(2)
Simple feature collection with 2 features and 44 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -86.91752 ymin: 36.29296 xmax: -86.80667 ymax: 36.39049
Geodetic CRS: NAD83
TRACTCE STATEFP COUNTYFP GEOID NAME.x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON primary_key incident_number
1 010103 47 037 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 20230053195_11 20230053195
2 010103 47 037 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 20230486382_11 20230486382
report_type report_type_description incident_status_code incident_status_description investigation_status incident_occurred incident_reported
1 D DISPATCHED O OPEN Open 2023-01-25 21:50:00 2023-01-26 00:47:00
2 D DISPATCHED O OPEN Open 2023-08-13 12:30:00 2023-08-17 14:22:00
incident_location zip_code location_code location_description offense_number offense_nibrs offense_description weapon_primary weapon_description
1 MORGAN RD NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 16 Unarmed
2 CLAY LICK NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 16 Unarmed
victim_number domestic_related victim_type victim_description mapped_location victim_gender victim_race victim_ethnicity victim_county_resident
1 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W Hispanic RESIDENT
2 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W Non-Hispanic RESIDENT
NAME.y state county population median_income geometry
1 Census Tract 101.03, Davidson County, Tennessee 47 037 2411 60000 MULTIPOLYGON (((-86.91752 3...
2 Census Tract 101.03, Davidson County, Tennessee 47 037 2411 60000 MULTIPOLYGON (((-86.91752 3...
| There appears to be the following: intplat=latitude intptlon=longitude Q’s that I may loo into as part of EDA… 1. why does the index follow an odd numbering schema? 2. Does countyfp or statefp ever change? 3. What is the distribution of times and days for incidents (need to convert date time…tried, but still have weird seconds thing…) 4. Various info on offense descriptions, weapons, victims, |
|---|
| Part 2 - Exploratory Analysis Perform some exploratory analysis on your prepared dataset. |
| Aggregate the data by census tract. Warning: each incident can appear multiple times if there are multiple victims, so be sure that you aren’t double-counting any incidents. |
| Which census tract had the highest number of burglaries? Which census tract had the highest number of burglaries per 1000 residents? |
| We’re interested in the relationship between median income and number of aggravated burglaries, so examine those variables on their own and together to see what you can find. You may want to perform additional calculations, create plots, etc. |
| Initial EDA |
r summary(burglaries_combined) |
| ``` TRACTCE STATEFP COUNTYFP GEOID NAME.x NAMELSAD MTFCC FUNCSTAT Length:1159 Length:1159 Length:1159 Length:1159 Length:1159 Length:1159 Length:1159 Length:1159 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character |
| ALAND AWATER INTPTLAT INTPTLON primary_key incident_number report_type Min. : 215341 Min. : 0 Length:1159 Length:1159 Length:1159 Min. :2.023e+10 Length:1159 1st Qu.: 1843898 1st Qu.: 0 Class :character Class :character Class :character 1st Qu.:2.023e+10 Class :character Median : 3250717 Median : 0 Mode :character Mode :character Mode :character Median :2.023e+10 Mode :character Mean : 5321199 Mean : 164169 Mean :2.023e+10 3rd Qu.: 6237959 3rd Qu.: 14553 3rd Qu.:2.023e+10 Max. :87771322 Max. :14968757 Max. :2.023e+10 NA’s :17 report_type_description incident_status_code incident_status_description investigation_status incident_occurred Length:1159 Length:1159 Length:1159 Length:1159 Min. :2023-01-01 06:00:00.0 Class :character Class :character Class :character Class :character 1st Qu.:2023-03-24 22:41:15.0 Mode :character Mode :character Mode :character Mode :character Median :2023-06-01 12:30:00.0 Mean :2023-05-31 22:06:08.2 3rd Qu.:2023-08-09 03:35:00.0 Max. :2023-10-21 18:00:00.0 NA’s :17 incident_reported incident_location zip_code location_code location_description offense_number offense_nibrs Min. :2023-01-01 19:25:00.00 Length:1159 Min. :37013 Min. : 6.0 Length:1159 Min. :1.000 Min. :220 1st Qu.:2023-03-26 19:11:45.00 Class :character 1st Qu.:37205 1st Qu.:22.0 Class :character 1st Qu.:1.000 1st Qu.:220 Median :2023-06-04 14:26:30.00 Mode :character Median :37208 Median :22.0 Mode :character Median :1.000 Median :220 Mean :2023-06-03 01:11:13.81 Mean :37182 Mean :43.7 Mean :1.102 Mean :220 3rd Qu.:2023-08-13 10:58:00.00 3rd Qu.:37211 3rd Qu.:90.0 3rd Qu.:1.000 3rd Qu.:220 Max. :2023-10-23 11:49:00.00 Max. :37228 Max. :90.0 Max. :5.000 Max. :220 NA’s :17 NA’s :922 NA’s :17 NA’s :17 NA’s :17 offense_description weapon_primary weapon_description victim_number domestic_related victim_type victim_description mapped_location Length:1159 Length:1159 Length:1159 Min. : 1.000 Mode :logical Length:1159 Length:1159 Mode:logical Class :character Class :character Class :character 1st Qu.: 1.000 FALSE:1029 Class :character Class :character NA’s:1159 Mode :character Mode :character Mode :character Median : 1.000 TRUE :113 Mode :character Mode :character Mean : 1.583 NA’s :17 3rd Qu.: 1.000 Max. :23.000 NA’s :17 victim_gender victim_race victim_ethnicity victim_county_resident NAME.y state county population Length:1159 Length:1159 Length:1159 Length:1159 Length:1159 Min. :47 Length:1159 Min. : 0 Class :character Class :character Class :character Class :character Class :character 1st Qu.:47 Class :character 1st Qu.:2876 Mode :character Mode :character Mode :character Mode :character Mode :character Median :47 Mode :character Median :3828 Mean :47 Mean :4111 3rd Qu.:47 3rd Qu.:5194 Max. :47 Max. :9896 |
| median_income geometry Min. :-666666666 MULTIPOLYGON :1159 1st Qu.: 40425 epsg:4269 : 0 Median : 52306 +proj=long…: 0 Mean : -5122846 3rd Qu.: 62196 Max. : 199643 |
| ``` |
r burglaries_combined %>% head(2) |
Simple feature collection with 2 features and 44 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: -86.91752 ymin: 36.29296 xmax: -86.80667 ymax: 36.39049 Geodetic CRS: NAD83 TRACTCE STATEFP COUNTYFP GEOID NAME.x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON primary_key incident_number 1 010103 47 037 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 20230053195_11 20230053195 2 010103 47 037 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 20230486382_11 20230486382 report_type report_type_description incident_status_code incident_status_description investigation_status incident_occurred incident_reported 1 D DISPATCHED O OPEN Open 2023-01-25 21:50:00 2023-01-26 00:47:00 2 D DISPATCHED O OPEN Open 2023-08-13 12:30:00 2023-08-17 14:22:00 incident_location zip_code location_code location_description offense_number offense_nibrs offense_description weapon_primary weapon_description 1 MORGAN RD NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 16 Unarmed 2 CLAY LICK NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 16 Unarmed victim_number domestic_related victim_type victim_description mapped_location victim_gender victim_race victim_ethnicity victim_county_resident 1 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W Hispanic RESIDENT 2 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W Non-Hispanic RESIDENT NAME.y state county population median_income geometry 1 Census Tract 101.03, Davidson County, Tennessee 47 037 2411 60000 MULTIPOLYGON (((-86.91752 3... 2 Census Tract 101.03, Davidson County, Tennessee 47 037 2411 60000 MULTIPOLYGON (((-86.91752 3... |
r burglaries_combined %>% tail(2) |
Simple feature collection with 2 features and 44 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: -86.77107 ymin: 36.08151 xmax: -86.75064 ymax: 36.1124 Geodetic CRS: NAD83 TRACTCE STATEFP COUNTYFP GEOID NAME.x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON primary_key incident_number 1158 980200 47 037 47037980200 9802 Census Tract 9802 G5020 S 3750174 0 +36.0995268 -086.7586124 20230352915_11 20230352915 1159 980200 47 037 47037980200 9802 Census Tract 9802 G5020 S 3750174 0 +36.0995268 -086.7586124 20230393116_11 20230393116 report_type report_type_description incident_status_code incident_status_description investigation_status incident_occurred incident_reported 1158 D DISPATCHED O OPEN Open 2023-06-16 08:00:00 2023-06-16 08:00:00 1159 D DISPATCHED O OPEN Open 2023-07-05 04:00:00 2023-07-05 05:25:00 incident_location zip_code location_code location_description offense_number offense_nibrs offense_description weapon_primary 1158 MELPARK DR NA 90 APARTMENT 1 220 BURGLARY- AGGRAVATED 17 1159 FRANKLIN PKE NA 90 APARTMENT 1 220 BURGLARY- AGGRAVATED 06 weapon_description victim_number domestic_related victim_type victim_description mapped_location victim_gender victim_race 1158 NONE 1 FALSE I INDIVIDUAL (18 AND OVER) NA F W 1159 LETHAL/CUTTING INSTRUMENT 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W victim_ethnicity victim_county_resident NAME.y state county population median_income 1158 Non-Hispanic RESIDENT Census Tract 9802, Davidson County, Tennessee 47 037 0 -666666666 1159 Hispanic RESIDENT Census Tract 9802, Davidson County, Tennessee 47 037 0 -666666666 geometry 1158 MULTIPOLYGON (((-86.77105 3... 1159 MULTIPOLYGON (((-86.77105 3... |
| We see here that there are 174 census tracts and they are connected to geometry… |
r census_by_tract<-burglaries_combined |> count(TRACTCE) census_by_tract |
Simple feature collection with 174 features and 2 fields Geometry type: GEOMETRY Dimension: XY Bounding box: xmin: -87.0547 ymin: 35.96779 xmax: -86.51559 ymax: 36.4055 Geodetic CRS: NAD83 First 10 features: TRACTCE n geometry 1 010103 2 POLYGON ((-86.91752 36.3397... 2 010104 5 POLYGON ((-86.97461 36.2493... 3 010105 4 POLYGON ((-86.89111 36.2620... 4 010106 3 POLYGON ((-86.83087 36.2655... 5 010201 6 POLYGON ((-86.81737 36.2738... 6 010202 1 POLYGON ((-86.82483 36.3321... 7 010301 1 POLYGON ((-86.7415 36.29325... 8 010302 1 POLYGON ((-86.72457 36.2967... 9 010303 1 POLYGON ((-86.71941 36.2925... 10 010401 5 POLYGON ((-86.71147 36.2729... |
| Note a few things about the answer above “NAD 83 provides a frame of reference for latitude and longitude locations on Earth. Surveyors now rely almost exclusively on the Global Positioning System (GPS) to identify locations on the Earth and incorporate them into existing geodetic datums. For example, NAD27, NAD83, and WGS84 are the most common geodetic datums in North America.” |
| Count by time. (How does time work in R? Do we have to do dt calcs?) |
r count_by_time<-burglaries_combined |> count(incident_occurred) count_by_time |
Simple feature collection with 883 features and 2 fields Geometry type: GEOMETRY Dimension: XY Bounding box: xmin: -87.0547 ymin: 35.96779 xmax: -86.51559 ymax: 36.4055 Geodetic CRS: NAD83 First 10 features: incident_occurred n geometry 1 2023-01-01 06:00:00 2 POLYGON ((-86.7497 36.18066... 2 2023-01-01 15:00:00 1 POLYGON ((-86.7515 36.11124... 3 2023-01-01 16:00:00 1 POLYGON ((-87.00796 36.1195... 4 2023-01-02 01:36:00 1 POLYGON ((-86.71499 36.0808... 5 2023-01-02 03:00:00 2 POLYGON ((-86.75783 36.0835... 6 2023-01-02 03:26:00 1 POLYGON ((-86.77507 36.1491... 7 2023-01-02 04:00:00 1 POLYGON ((-86.77507 36.1491... 8 2023-01-02 19:18:00 1 POLYGON ((-86.8216 36.22223... 9 2023-01-02 21:40:00 1 POLYGON ((-86.78276 36.1909... 10 2023-01-03 06:00:00 1 POLYGON ((-86.66638 36.1035... |
r burglaries_combined <- burglaries_combined |> mutate(incident_difference = incident_reported - incident_occurred) burglaries_combined |
Simple feature collection with 1159 features and 45 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: -87.0547 ymin: 35.96778 xmax: -86.51559 ymax: 36.4055 Geodetic CRS: NAD83 First 10 features: TRACTCE STATEFP COUNTYFP GEOID NAME.x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON primary_key 1 010103 47 037 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 20230053195_11 2 010103 47 037 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 20230486382_11 3 010104 47 037 47037010104 101.04 Census Tract 101.04 G5020 S 65057849 251504 +36.2940028 -086.8777483 20230029138_11 4 010104 47 037 47037010104 101.04 Census Tract 101.04 G5020 S 65057849 251504 +36.2940028 -086.8777483 20230073829_31 5 010104 47 037 47037010104 101.04 Census Tract 101.04 G5020 S 65057849 251504 +36.2940028 -086.8777483 20230112598_21 6 010104 47 037 47037010104 101.04 Census Tract 101.04 G5020 S 65057849 251504 +36.2940028 -086.8777483 20230260489_11 7 010104 47 037 47037010104 101.04 Census Tract 101.04 G5020 S 65057849 251504 +36.2940028 -086.8777483 20230552785_11 8 010105 47 037 47037010105 101.05 Census Tract 101.05 G5020 S 28328799 1093 +36.2504208 -086.8521501 20230022927_11 9 010105 47 037 47037010105 101.05 Census Tract 101.05 G5020 S 28328799 1093 +36.2504208 -086.8521501 20230222174_11 10 010105 47 037 47037010105 101.05 Census Tract 101.05 G5020 S 28328799 1093 +36.2504208 -086.8521501 20230460999_11 incident_number report_type report_type_description incident_status_code incident_status_description investigation_status incident_occurred 1 20230053195 D DISPATCHED O OPEN Open 2023-01-25 21:50:00 2 20230486382 D DISPATCHED O OPEN Open 2023-08-13 12:30:00 3 20230029138 D DISPATCHED O OPEN Open 2023-01-14 01:00:00 4 20230073829 D DISPATCHED O OPEN Open 2023-02-04 12:00:00 5 20230112598 D DISPATCHED A CLEARED BY ARREST Closed 2023-02-17 16:00:00 6 20230260489 D DISPATCHED O OPEN Open 2023-05-03 17:46:00 7 20230552785 D DISPATCHED O OPEN Open 2023-09-16 21:00:00 8 20230022927 D DISPATCHED O OPEN Open 2023-01-11 20:45:00 9 20230222174 D DISPATCHED O OPEN Open 2023-04-15 20:05:00 10 20230460999 D DISPATCHED O OPEN Open 2023-08-06 08:00:00 incident_reported incident_location zip_code location_code location_description offense_number offense_nibrs offense_description weapon_primary 1 2023-01-26 00:47:00 MORGAN RD NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 16 2 2023-08-17 14:22:00 CLAY LICK NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 16 3 2023-01-14 17:15:00 OLD CLARKSVILLE NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 09 4 2023-02-04 13:47:00 OLD HICKORY BLVD NA 22 RESIDENCE, HOME 3 220 BURGLARY - AGGRAVATED 17 5 2023-02-22 14:07:00 6503 6503 37080 22 RESIDENCE, HOME 2 220 BURGLARY- AGGRAVATED 17 6 2023-05-03 23:11:00 UNION HILL RD NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 09 7 2023-09-17 00:06:00 CARNEY RD NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 17 8 2023-01-11 22:35:00 TUCKER RD NA 22 RESIDENCE, HOME 1 220 BURGLARY - AGGRAVATED 09 9 2023-04-15 21:58:00 DRY FORK RD NA 22 RESIDENCE, HOME 1 220 BURGLARY - AGGRAVATED 09 10 2023-08-06 19:23:00 STEVENS LN NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 09 weapon_description victim_number domestic_related victim_type victim_description mapped_location victim_gender victim_race victim_ethnicity 1 Unarmed 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W Hispanic 2 Unarmed 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W Non-Hispanic 3 PERSONAL (HANDS) 1 FALSE I INDIVIDUAL (18 AND OVER) NA F W Non-Hispanic 4 NONE 1 TRUE I INDIVIDUAL (18 AND OVER) NA F W Non-Hispanic 5 NONE 1 FALSE I INDIVIDUAL (18 AND OVER) NA F W Non-Hispanic 6 PERSONAL (HANDS) 1 FALSE I INDIVIDUAL (18 AND OVER) NA F B Non-Hispanic 7 NONE 1 FALSE I INDIVIDUAL (18 AND OVER) NA F W Non-Hispanic 8 PERSONAL (HANDS) 1 FALSE I INDIVIDUAL (18 AND OVER) NA F B Non-Hispanic 9 PERSONAL (HANDS) 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W Non-Hispanic 10 PERSONAL (HANDS) 1 FALSE I INDIVIDUAL (18 AND OVER) NA M B Non-Hispanic victim_county_resident NAME.y state county population median_income geometry 1 RESIDENT Census Tract 101.03, Davidson County, Tennessee 47 037 2411 60000 MULTIPOLYGON (((-86.91752 3... 2 RESIDENT Census Tract 101.03, Davidson County, Tennessee 47 037 2411 60000 MULTIPOLYGON (((-86.91752 3... 3 RESIDENT Census Tract 101.04, Davidson County, Tennessee 47 037 3002 84831 MULTIPOLYGON (((-86.9744 36... 4 RESIDENT Census Tract 101.04, Davidson County, Tennessee 47 037 3002 84831 MULTIPOLYGON (((-86.9744 36... 5 RESIDENT Census Tract 101.04, Davidson County, Tennessee 47 037 3002 84831 MULTIPOLYGON (((-86.9744 36... 6 RESIDENT Census Tract 101.04, Davidson County, Tennessee 47 037 3002 84831 MULTIPOLYGON (((-86.9744 36... 7 RESIDENT Census Tract 101.04, Davidson County, Tennessee 47 037 3002 84831 MULTIPOLYGON (((-86.9744 36... 8 RESIDENT Census Tract 101.05, Davidson County, Tennessee 47 037 4839 61115 MULTIPOLYGON (((-86.89144 3... 9 RESIDENT Census Tract 101.05, Davidson County, Tennessee 47 037 4839 61115 MULTIPOLYGON (((-86.89144 3... 10 RESIDENT Census Tract 101.05, Davidson County, Tennessee 47 037 4839 61115 MULTIPOLYGON (((-86.89144 3... incident_difference 1 10620 secs 2 352320 secs 3 58500 secs 4 6420 secs 5 425220 secs 6 19500 secs 7 11160 secs 8 6600 secs 9 6780 secs 10 40980 secs |
| ```r |
| burglaries_combined <- burglaries_combined %>% mutate(incident_difference_hrs = incident_difference/(60*60)) burglaries_combined %>% head(2) ``` |
Simple feature collection with 2 features and 46 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: -86.91752 ymin: 36.29296 xmax: -86.80667 ymax: 36.39049 Geodetic CRS: NAD83 TRACTCE STATEFP COUNTYFP GEOID NAME.x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON primary_key incident_number 1 010103 47 037 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 20230053195_11 20230053195 2 010103 47 037 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 20230486382_11 20230486382 report_type report_type_description incident_status_code incident_status_description investigation_status incident_occurred incident_reported 1 D DISPATCHED O OPEN Open 2023-01-25 21:50:00 2023-01-26 00:47:00 2 D DISPATCHED O OPEN Open 2023-08-13 12:30:00 2023-08-17 14:22:00 incident_location zip_code location_code location_description offense_number offense_nibrs offense_description weapon_primary weapon_description 1 MORGAN RD NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 16 Unarmed 2 CLAY LICK NA 22 RESIDENCE, HOME 1 220 BURGLARY- AGGRAVATED 16 Unarmed victim_number domestic_related victim_type victim_description mapped_location victim_gender victim_race victim_ethnicity victim_county_resident 1 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W Hispanic RESIDENT 2 1 FALSE I INDIVIDUAL (18 AND OVER) NA M W Non-Hispanic RESIDENT NAME.y state county population median_income geometry incident_difference 1 Census Tract 101.03, Davidson County, Tennessee 47 037 2411 60000 MULTIPOLYGON (((-86.91752 3... 10620 secs 2 Census Tract 101.03, Davidson County, Tennessee 47 037 2411 60000 MULTIPOLYGON (((-86.91752 3... 352320 secs incident_difference_hrs 1 2.95000 secs 2 97.86667 secs |
| ```r incident_times<- burglaries_combined |> st_drop_geometry() |> group_by(TRACTCE) %>% distinct(incident_difference_hrs) %>% arrange(desc(incident_difference_hrs)) |
| ``` |
| ```r ggplot(incident_times, aes(x=TRACTCE, y=incident_difference_hrs)) + geom_point(color=‘darkblue’, fill=‘black’) |
| ``` |
| ```r |
| burglaries_combined |> st_drop_geometry() |> group_by(TRACTCE) %>% distinct(incident_difference_hrs) %>% arrange(incident_difference_hrs) ``` |
r NA |
r burglaries_combined |> st_drop_geometry() |> group_by(weapon_description) |> count(name = "weapons") |> arrange(desc(weapons)) |
r burglaries_combined |> st_drop_geometry() |> group_by(victim_type) |> count(name = "victim") |> arrange(desc(victim)) |
r burglaries_combined |> st_drop_geometry() |> group_by(victim_number) |> count(name = "victim") |> arrange(desc(victim)) |
r burglaries_combined |> ggplot() + geom_sf() |
r burglaries_combined |> ggplot() + geom_sf(aes(fill = TRACTCE)) |
| Need to pop the map out to see it… |
r ggplot(burglaries_combined, aes(x=zip_code)) + geom_histogram(bin=30, color='black', fill='blue') |
Warning: Ignoring unknown parameters: `bin` |
| Aggregate the data by census tract. Warning: each incident can appear multiple times if there are multiple victims, be sure that you aren’t double-counting any incidents. |
| #Use group-by nunique?(use distinct, also it must go after groupby) #filter by unique incident_number, then aggregate by TRACTCE |
| Initial Aggregation: (before aggregating out individual incidents) #note that some drop the geometry, using st_drop_geometry() |> not included here. |
r burglaries_combined |> group_by(TRACTCE) |> count(name = "count") |> arrange(desc(count)) |
Simple feature collection with 174 features and 2 fields Geometry type: GEOMETRY Dimension: XY Bounding box: xmin: -87.0547 ymin: 35.96779 xmax: -86.51559 ymax: 36.4055 Geodetic CRS: NAD83 |
| Secondary Aggregation: (after aggregating out individual incidents) Burglaries aggregated by incident number… #note that some suggest to drop the geometry, using st_drop_geometry() |> not included here. (You could take this and then make another one that subsets out the population and knit it together…) |
r burglaries_count_per <- burglaries_combined |> group_by(TRACTCE) |> distinct(incident_number) |> count(name = "incident_number") |> arrange(desc(incident_number)) burglaries_count_per |
| Incident_number counts |
r incidents_agg_count <- burglaries_combined |> group_by(incident_number) |> count(name = "count") |> arrange(desc(count)) incidents_agg_count |
Simple feature collection with 895 features and 2 fields Geometry type: GEOMETRY Dimension: XY Bounding box: xmin: -87.0547 ymin: 35.96779 xmax: -86.51559 ymax: 36.4055 Geodetic CRS: NAD83 |
| looking at the count of incident numbers graphically. |
| ```r ggplot(incidents_agg_count, aes(x=count)) + geom_bar( color=‘black’, fill=‘blue’) |
| ``` |
| Which census tract had the highest number of burglaries? (016000) |
| Which census tract had the highest number of burglaries per 1000 residents? (016000) (unless you count the two tracts with no population, then 980100…) #to get this number we have to subset number of incidents by population. Perhaps this is best done with a separate column… |
| #What was I doing here? Code no longer runs…trying to get the population… and the tract |
| ```r #burglaries_pop <- burglaries_combined |> # select(TRACTCE, population)|> # group_by(TRACTCE) |> # distinct(TRACTCE) |
| burglaries_pop <- burglaries_combined |> st_drop_geometry() |> dplyr::select(TRACTCE, population) |> group_by(TRACTCE) |> distinct(population) ``` |
r burglaries_combined |> group_by(TRACTCE) |> distinct(incident_number) |> count(name = "incident_number") |> arrange(desc(incident_number)) |
| ```r burglaries_joined <- merge(burglaries_count_per, burglaries_pop) |
| ``` |
r burglaries_joined <- burglaries_joined %>% mutate(burglaries_per_capita = population/incident_number) %>% arrange(burglaries_per_capita) burglaries_joined |
| We’re interested in the relationship between median income and number of aggravated burglaries, so examine those variables on their own and together to see what you can find. You may want to perform additional calculations, create plots, etc |
| ```r ggplot(burglaries_joined, aes(x=population, y=burglaries_per_capita)) + geom_point(color=‘darkblue’, fill=‘black’) |
| ``` |
r burglaries_median <- burglaries_combined |> dplyr::select(TRACTCE, median_income, geometry) |> group_by(TRACTCE) |> distinct(median_income) burglaries_median |
| ```r burglaries_joined <- merge(burglaries_joined, burglaries_median) |
| burglaries_joined %>% arrange(desc(incident_number)) ``` |
| (need to subset some of the data here. There are several observations that are -666666666 in median income…) |
r burglaries_joined<- burglaries_joined %>% filter(median_income > 0) %>% arrange(desc(median_income)) |
| The graph below looks to be sort of linear, in a negative fashion…maybe fit a linear regression to it? |
| ```r burgs_lm <- lm(“incident_number ~ median_income”, data = burglaries_joined) |
| summary(burgs_lm) ``` |
| ``` |
| Call: lm(formula = “incident_number ~ median_income”, data = burglaries_joined) |
| Residuals: Min 1Q Median 3Q Max -5.758 -3.155 -1.317 2.012 30.565 |
| Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.918e+00 8.856e-01 11.199 < 2e-16 median_income -6.781e-05 1.178e-05 -5.757 3.97e-08 |
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.922 on 168 degrees of freedom Multiple R-squared: 0.1648, Adjusted R-squared: 0.1598 F-statistic: 33.15 on 1 and 168 DF, p-value: 3.973e-08
<!-- rnb-output-end -->
<!-- rnb-chunk-end -->
<!-- rnb-text-begin -->
<!-- rnb-text-end -->
<!-- rnb-chunk-begin -->
<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuXG5nZ3Bsb3QoYnVyZ2xhcmllc19qb2luZWQsIGFlcyggeD1tZWRpYW5faW5jb21lLCB5PWluY2lkZW50X251bWJlcikpICtcbiAgZ2VvbV9wb2ludChjb2xvcj0nZGFya2JsdWUnKVxuXG5gYGAifQ== -->
```r
ggplot(burglaries_joined, aes( x=median_income, y=incident_number)) +
geom_point(color='darkblue')
(Map with burglaries. Darker are more incidents.)
incidents_agg_count %>%
ggplot() +
geom_sf(aes(fill = incident_number))
NA
incidents_agg_count %>%
head(2)
Simple feature collection with 2 features and 2 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: -86.99052 ymin: 36.0076 xmax: -86.56665 ymax: 36.34359
Geodetic CRS: NAD83
Part 3 - Statistical Modeling Fit a Poisson regression model with target variable the rate of burglaries per census tract and with predictor the median income. Offset using the log of the population so that we are looking at the rate of burglaries per population instead of the number of burglaries. How can you interpret the meaning of the output? How do the estimates from the model compare to the observed data?
target variable the rate of burglaries per census tract #I assume this means we are using time? #In the example the ‘rate’ was just the count of the visits #Ran here without a time component, just using number of unique incidents per tract
burglaries_joined |>
ggplot(aes(x = incident_number)) +
geom_bar(color='black', fill='blue')
pois_num_0 <- glm('incident_number ~ 1',
data = burglaries_joined,
family = poisson)
summary(pois_num_0)
Call:
glm(formula = "incident_number ~ 1", family = poisson, data = burglaries_joined)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6688 0.0333 50.12 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 695.55 on 169 degrees of freedom
Residual deviance: 695.55 on 169 degrees of freedom
AIC: 1235.4
Number of Fisher Scoring iterations: 5
Above is the constant only model. Note the AIC is quite high,
pois_num_1 <- glm('incident_number ~ median_income',
data = burglaries_joined,
family = poisson)
summary(pois_num_1)
Call:
glm(formula = "incident_number ~ median_income", family = poisson,
data = burglaries_joined)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.855e+00 9.118e-02 31.31 <2e-16 ***
median_income -1.958e-05 1.546e-06 -12.66 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 695.55 on 169 degrees of freedom
Residual deviance: 494.75 on 168 degrees of freedom
AIC: 1036.6
Number of Fisher Scoring iterations: 5
Poisson Notes: Coefficients (To get the estimates for something in a poisson, you have to exponentiate the results to get the average…)
The intercept represents the expected log-count of burglaries when all other predictor variables are zero. In this case, when all other predictor variables are zero, the expected log-count of burglaries is approximately 2.855. Since this is a log-count, to interpret it in terms of count, you would exponentiate it. So, exp(2.855) gives you the expected count of burglaries.
For a one unit change in median_income, the difference in the logs of expected counts of burglaries is expected to change by -1.958e-05 Or, for every dollar increase in median income, burglaries are expected to decrease by 0.00001958.
A good way to think about this is by 0.00001958. and times by 10k…
Oftentimes the poisson results are converted to ‘incident response ratios’ (IRR) this is done by taking the exponential of the result or exp(-1.958e-05) = .99998402. This represents the multiplicative change in the expected count for a one-unit increase in the predictor. So, another way to interpret this is to say that for every one-unit increase in median income, the IRR decreases by 0.99998042, or a little less than one percent. (that can’t be right? unless it goes to 10000 and then starts over?)
With large p values, we can say that both variables are statistically significant.
Note to that the std errors are small…
AIC=Stands for Akaike Information Criterion. Basically, the lower the AIC, the better the model fit. Compare to the other 3 pois_reg for fit comparisons…
Number of Fisher Scoring iterations is a count of how long it took to fit the model. Safely ignore…
Let’s see what it estimates for the mean number of the distribution:
mu = exp(coef(pois_num_1 ))
And plot the result. (IMPORTANT! these are means…)
x <- 0:39
y <- dpois(x, lambda = mu)
tibble(x = x, y = y) |>
ggplot(aes(x = x, y = y)) +
geom_col(color='black', fill='green')
So just so we’re clear here, this means there are 40 tracts with 1 incident report, and 24 tracts with 2…1 tract with 39. And, in the above graph we’re taking the means of all of those as a percentage of the total.
burglaries_joined|>
count(incident_number)
An example poisson dist for comparison
median_income <- seq(from = 20000, to = 180000, length.out = 5)
map(median_income,
\(x) tibble(median_income = x,
num_burglaries = 0:20,
probability = dpois(0:20,
lambda = predict(pois_num_2,
newdata = tibble(median_income = x, population = 4000), type = "response")
)
)
) |>
bind_rows() |>
ggplot(aes(x = num_burglaries, y = probability)) +
geom_col() +
facet_wrap(~median_income)
Offset using the log of the population so that we are looking at the rate of burglaries per population instead of the number of burglaries
burglaries_joined <- burglaries_joined |>
mutate(poplog = log10(population))
pois_num_2 <- glm('incident_number ~ median_income',
data = burglaries_joined,
family = poisson,
offset = log(population))
summary(pois_num_2)
Call:
glm(formula = "incident_number ~ median_income", family = poisson,
data = burglaries_joined, offset = log(population))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.226e+00 9.670e-02 -54.04 <2e-16 ***
median_income -2.326e-05 1.654e-06 -14.06 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 764.21 on 169 degrees of freedom
Residual deviance: 517.12 on 168 degrees of freedom
AIC: 1058.9
Number of Fisher Scoring iterations: 5
(Reminder: Degrees of Freedom are the maximum number of logically independent values, which may vary, in a data sample)
For a one unit change in median_income, the difference in the logs of expected counts of burglaries is expected to change by -1.958e-05 Or, for every dollar increase in median income, burglaries are expected to decrease by 0.00002326 in the log of expected counts. #Note that the results are a little different, even though they should be the same…
pois_num_3 <- glm('incident_number ~ median_income',
data = burglaries_joined,
family = poisson,
offset = poplog)
summary(pois_num_3)
Call:
glm(formula = "incident_number ~ median_income", family = poisson,
data = burglaries_joined, offset = poplog)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.384e-01 9.366e-02 -6.816 9.35e-12 ***
median_income -2.120e-05 1.595e-06 -13.294 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 698.89 on 169 degrees of freedom
Residual deviance: 477.84 on 168 degrees of freedom
AIC: 1019.7
Number of Fisher Scoring iterations: 5
Compare model fit using deviance. The lower the number, the better…Here we can say that the deviance score between pois_num_2 and pois_num_3 dropped by 39.285. Thusly, model# 3 fit better.
drop_in_dev<- anova(pois_num_2, pois_num_3, test = "Chisq")
drop_in_dev
Analysis of Deviance Table
Model 1: incident_number ~ median_income
Model 2: incident_number ~ median_income
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 168 517.12
2 168 477.84 0 39.285
burglaries_joined |>
mutate(group = ntile(median_income, n = 10)) |>
group_by(group) |>
summarize(
median_income = median(median_income),
median_incident_number = median(incident_number)
)
burglaries_joined |>
mutate(group = ntile(median_income, n = 10)) |>
group_by(group) |>
ggplot(aes(x = incident_number)) +
geom_bar(color='black', fill='darkred') +
facet_wrap(~group)
burglaries_joined<- burglaries_joined |>
mutate(group = ntile(median_income, n = 5))
mu3 = exp(coef(pois_num_3 ))
And plot the result. (these are means…)
x <- 0:35
y <- dpois(x, lambda = mu3)
tibble(x = x, y = y) |>
ggplot(aes(x = x, y = y)) +
geom_col(color='black', fill='orange')
ggplot(burglaries_joined, aes( x=population, y=incident_number, color='brown')) +
geom_point()
est_df <- tibble(
median_income = seq(from = min(burglaries_joined$median_income, na.rm = TRUE),
to = max(burglaries_joined$median_income, na.rm = TRUE),
length.out = 100),
population = seq(from = min(burglaries_joined$population, na.rm = TRUE),
to = max(burglaries_joined$population, na.rm = TRUE),
length.out = 100), )
est_df <- est_df %>%
bind_cols(predict(pois_num_2, newdata = est_df, type = "response")) %>%
rename("Estimated Mean Number of Incidents" = "...3")
New names:
ggplot(est_df) +
geom_line(aes(x = median_income, y = `Estimated Mean Number of Incidents`)) +
geom_point(data = burglaries_joined, aes(x=median_income, y=incident_number))
NA
est_df %>%
ggplot(aes(x = median_income, y = `Estimated Mean Number of Incidents`)) +
geom_line(color = 'green')
dispersion parameter is high in the below model, indicating overdispersion (= extra variance…) It’s interesting that graphically the fit is almost the same as the standard poisson model. Intuition says there is likely a better fit with negative binomial…
pois_num_5 <- glm('incident_number ~ median_income',
data = burglaries_joined,
family = quasipoisson(),
offset = log(population))
summary(pois_num_5)
Call:
glm(formula = "incident_number ~ median_income", family = quasipoisson(),
data = burglaries_joined, offset = log(population))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.226e+00 1.968e-01 -26.549 < 2e-16 ***
median_income -2.326e-05 3.367e-06 -6.909 9.65e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 4.143165)
Null deviance: 764.21 on 169 degrees of freedom
Residual deviance: 517.12 on 168 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
est_df_1 <- tibble(
median_income = seq(from = min(burglaries_joined$median_income, na.rm = TRUE),
to = max(burglaries_joined$median_income, na.rm = TRUE),
length.out = 100),
population = seq(from = min(burglaries_joined$population, na.rm = TRUE),
to = max(burglaries_joined$population, na.rm = TRUE),
length.out = 100), )
est_df_1 <- est_df_1 %>%
bind_cols(predict(pois_num_5, newdata = est_df_1, type = "response")) %>%
rename("Estimated Mean Number of Incidents" = "...3")
New names:
ggplot(est_df_1) +
geom_line(aes(x = median_income, y = `Estimated Mean Number of Incidents`)) +
geom_point(data = burglaries_joined, aes(x=median_income, y=incident_number))
NA
est_df_1 %>%
ggplot(aes(x = median_income, y = `Estimated Mean Number of Incidents`)) +
geom_line(color = 'orange')
Below is the negative binomial. Note that it gives a significantly lower AIC score!
neg_bin_1 <- glm.nb('incident_number ~ median_income',
data = burglaries_joined)
summary(neg_bin_1)
Call:
glm.nb(formula = "incident_number ~ median_income", data = burglaries_joined,
init.theta = 2.78080513, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.644e+00 1.530e-01 17.282 < 2e-16 ***
median_income -1.611e-05 2.281e-06 -7.064 1.62e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(2.7808) family taken to be 1)
Null deviance: 231.66 on 169 degrees of freedom
Residual deviance: 168.05 on 168 degrees of freedom
AIC: 872.94
Number of Fisher Scoring iterations: 1
Theta: 2.781
Std. Err.: 0.461
2 x log-likelihood: -866.937
est_df_2 <- tibble(
median_income = seq(from = min(burglaries_joined$median_income, na.rm = TRUE),
to = max(burglaries_joined$median_income, na.rm = TRUE),
length.out = 100),
population = seq(from = min(burglaries_joined$population, na.rm = TRUE),
to = max(burglaries_joined$population, na.rm = TRUE),
length.out = 100), )
est_df_2 <- est_df_2 %>%
bind_cols(predict(neg_bin_1, newdata = est_df_2, type = "response")) %>%
rename("Estimated Mean Number of Incidents" = "...3")
New names:
ggplot(est_df_2) +
geom_line(aes(x = median_income, y = `Estimated Mean Number of Incidents`)) +
geom_point(data = burglaries_joined, aes(x=median_income, y=incident_number))
NA
est_df_2 %>%
ggplot(aes(x = median_income, y = `Estimated Mean Number of Incidents`)) +
geom_line(color = 'orange')
Comparing goodness of fit with deviance scores, using the Chi squared test. a deviance residual describes how the observed data deviates from the fitted model. Note the VERY large drop in deviance. Almost 310 pts. This is a significant increase in model fit!
drop_in_dev<- anova(pois_num_3, neg_bin_1, test = "Chisq")
drop_in_dev
Analysis of Deviance Table
Model 1: incident_number ~ median_income
Model 2: incident_number ~ median_income
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 168 477.84
2 168 168.05 0 309.78